Gathering insights about COVID19 Worldwide Testing using Linear Mixed Models

UWF IDC6940 Capstone Project

Alex Ottley and Autumn Clark (Advisor: Dr. Cohen)

2025-04-23

Introduction

Understanding Linear Mixed Models (LMMs)

Linear Mixed Models (LMMs) are statistical models that extend traditional linear regression by including both:

  • Fixed effects: These are the overall effects you’re interested in studying (e.g., how testing volume affects case counts across all countries).

  • Random effects: These account for variation within subgroups or clusters (e.g., differences between countries), allowing intercepts and/or slopes to vary by group.

Understanding Linear Mixed Models (LMMs) cont.

LMMs are especially useful when:

  • Data is hierarchical or nested (like days within countries),
  • Measurements are repeated over time,
  • The data is unbalanced (not the same number of observations for each group).

Software Packages used for LMMs

  • LMMs are widely used for analyzing data with both fixed and random effects, especially in fields involving repeated measures or hierarchical data structures.
  • A major advancement in the implementation of LMMs in R is the development of the lme4 package, documented by Bates et al.(2015) (Bates et al. 2005). This package provides a framework for specifying and fitting mixed-effects models, using sparse matrix methods and numerical optimization to handle complex random-effects structures.
# loading packages 
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(lme4)
library(nlme)
library(dplyr)
library(ggplot2)
library(readr)

Why Linear Mixed Models (LMMs)?

  • Suitable for the data set we chose
  • Daily observations (such as test counts, positive cases, positivity rates, etc) for multiple countries across a span of time (multiple observations over time)
    • This results in hierarchical or nested data, that is repeated measures nested within countries.

Methods

Model Structure and Notation

The general form of a linear mixed model is: \[ y_{ij} = \beta_0 + \beta_1 x_{ij} + b_{0j} + b_{1j}x_{ij} + \varepsilon_{ij} \] Where,

  • \(y_{ij}\): response variable for observation in \(i\) in group \(j\)

  • \(x_{ij}\): predictor (e.g., scaled daily tests)

  • \(\beta_0,\beta_1\): fixed intercept and slope (population-level)

  • \(b_{0j}, b_{1j}\): random intercept and slope for group \(j\) (e.g., country)

  • \(\varepsilon_{ij}\): residual error, assumed to be normally distributed: \(\varepsilon_{ij} \sim N (0, \sigma^2)\)

Model Structure and Notation cont.

In matrix form, the LMM can be represented as:

\[ y = X\beta + Zb + \varepsilon \] Where,

  • \(y\): vector of outcomes

  • \(X\): fixed-effects design matrix

  • \(\beta\): vector of fixed effect coefficients

  • \(Z\): random-effects design matrix

  • \(b\): vector of random effects, assumed \(\sim N (0, G)\)

  • \(\varepsilon\): residual errors, assumed \(\sim N(0, R)\)

Assumptions

  1. Linearity - the relationship between predictor (fixed effects) and response variable are linear.
  2. Independence of Residual Errors - residual errors of a model are assumed to be independent within each level of the random effect structure
  3. Normality of Residual Errors and Random Effects - Residual errors and random effects are assumed to be normally distributed.
  4. Homoscedasticity, the variance of the residuals, should be constant across all levels of the independent variables.

Capstone Project Data Overview

Data Import

  • Our data comes from a Covid19 Worldwide Testing dataset on Kaggle. It includes daily test counts, daily positive cases, and other variables across different countries over time.
# Load Data
library(readr)
COVID19 <- read_csv("tested_worldwide.csv")
kable(head(COVID19))
Date Country_Region Province_State positive active hospitalized hospitalizedCurr recovered death total_tested daily_tested daily_positive
2020-01-16 Iceland All States 3 NA NA NA NA NA NA NA NA
2020-01-17 Iceland All States 4 NA NA NA NA NA NA NA 1
2020-01-18 Iceland All States 7 NA NA NA NA NA NA NA 3
2020-01-20 South Korea All States 1 NA NA NA NA NA 4 NA NA
2020-01-22 United States All States 0 NA NA NA NA 0 0 NA NA
2020-01-22 United States Massachusetts 0 NA NA NA NA 0 0 NA NA
df <- COVID19

Application to COVID-19 Data

  • A study was conducted to employ linear mixed models to examine the relationship between the number of daily COVID-19 positive cases reported across various countries over time.The data set consists of repeated daily observations for each country, which results in multilevel structure where each country is considered a separate group with its own data points over time. This hierarchical structure gives the opportunity to account for both the overall trends in the data and the variability across different countries.

  • The model was designed to account for both the fixed effects of the number of daily tests conducted, which can influence case detection, and the random effects that capture country-specific deviations in testing practices and positivity rates. The model allows the both the intercept, baseline positivity, and the slope, the effect of testing on reported cases, to vary by country.

Exploratory Data Analysis

  • Before modeling, we cleaned the data by removing rows with missing values in the daily test count, daily positive cases, and region columns.
  • We also scaled the test count variable by dividing it by 1,000 for easier interpretation.
df <- COVID19
#Convert Date to Date format
df$Date <- as.Date(df$Date)

#Remove rows with missing values in relevant columns
df_clean <- df %>%
  filter(!is.na(daily_positive), !is.na(daily_tested), !is.na(Country_Region))

#Rescale daily_tested column to improve interpretation and convergence
df_clean$daily_tested_scaled <- df_clean$daily_tested / 1000

Analysis & Results

Modeling

#Fit LMM with random intercept and slope
model_refined <- lmer(daily_positive ~ daily_tested_scaled + (daily_tested_scaled | Country_Region), data = df_clean)

Key components of the model

  • Response variable: daily_positive — the number of newly confirmed cases per day
  • Fixed effect predictor: daily_tested_scaled — the number of daily tests conducted (scaled by 1,000)
  • Random effects: Intercept and slope for each Country_Region, allowing both baseline positivity and test effectiveness to vary by country

The model formula in R was specified as: lmer(daily_positive ~ daily_tested_scaled + (daily_tested_scaled | Country_Region), data = df_clean)

  • This model allows each country to have its own baseline number of positive cases and its own relationship between testing volume and positivity.

Fixed Effects Interpretation

#Extract and print key fixed effects
fixef(model_refined) #estimates for population-level intercept and slope
        (Intercept) daily_tested_scaled 
           95.05436            52.16033 

Random Effects Interpretation

VarCorr(model_refined)
 Groups         Name                Std.Dev. Corr  
 Country_Region (Intercept)          659.941       
                daily_tested_scaled   64.935 -0.423
 Residual                           1725.488       

Visual Insights

Visualizations of fitted model predictions against observed data showed that:

  • The relationship between testing and positivity varies across countries.

  • Some countries exhibit steeper slopes (e.g., South Korea), while others show flatter or more variable trends (e.g., United States).

  • This validates the use of a random slope model, as a single global slope would fail to capture such heterogeneity.

Conclusion

  • Linear mixed models are powerful tools for analyzing complex, time-based public health data, capturing both global trends and country-specific differences.
  • This project helped us understand global COVID-19 testing patterns, showing that the number of tests is a strong predictor of detected cases, but the effect size varies by country.
  • It also shows the value of LMMs over simpler models, like linear regression, when working with complex, international health data.

References

Bates, Douglas et al. 2005. “Fitting Linear Mixed Models in r.” R News 5 (1): 27–30.